The successful completion of this project aims to help the social media conglomerate make data-driven decisions to enhance its content strategy, prepare for an IPO, and improve its big data practices, ultimately leading to greater success and growth in the highly competitive social media industry.
The first part of the project involves data analysis to understand the content categories that are most popular on the social media platform. This analysis will involve processing large volumes of data to identify patterns, trends, and user preferences. The goal is to determine the top 5 content categories that have the highest aggregate popularity based on metrics such as likes, shares, comments, and engagement.
The identification of the top 5 content categories with the highest aggregate popularity on the social media platform. Insights into user preferences, engagement patterns, and content trends, which can inform the company's content strategy. Recommendations on how to optimize content creation and promotion to increase user engagement and retention.
The raw data used in this project is available on GitHub
#all the libraries to be used will be imported first:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
import datetime
#the next step is to examine the content table for details regarding the contents that were uploaded:
df1 = pd.read_csv('Content.csv')
df1
| Unnamed: 0 | Content ID | User ID | Type | Category | URL | |
|---|---|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 8d3cd87d-8a31-4935-9a4f-b319bfe05f31 | photo | Studying | https://socialbuzz.cdn.com/content/storage/975... |
| 1 | 1 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | beb1f34e-7870-46d6-9fc7-2e12eb83ce43 | photo | healthy eating | https://socialbuzz.cdn.com/content/storage/9f7... |
| 2 | 2 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | a5c65404-5894-4b87-82f2-d787cbee86b4 | photo | healthy eating | https://socialbuzz.cdn.com/content/storage/230... |
| 3 | 3 | 356fff80-da4d-4785-9f43-bc1261031dc6 | 9fb4ce88-fac1-406c-8544-1a899cee7aaf | photo | technology | https://socialbuzz.cdn.com/content/storage/356... |
| 4 | 4 | 01ab84dd-6364-4236-abbb-3f237db77180 | e206e31b-5f85-4964-b6ea-d7ee5324def1 | video | food | https://socialbuzz.cdn.com/content/storage/01a... |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | 995 | b4cef9ef-627b-41d7-a051-5961b0204ebb | 5b62e10e-3c19-4d28-a57c-e9bdc3d6758d | video | public speaking | NaN |
| 996 | 996 | 7a79f4e4-3b7d-44dc-bdef-bc990740252c | 4fe420fa-a193-4408-bd5d-62a020233609 | GIF | technology | https://socialbuzz.cdn.com/content/storage/7a7... |
| 997 | 997 | 435007a5-6261-4d8b-b0a4-55fdc189754b | 35d6a1f3-e358-4d4b-8074-05f3b7f35c2a | audio | veganism | https://socialbuzz.cdn.com/content/storage/435... |
| 998 | 998 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | b9bcd994-f000-4f6b-87fc-caae08acfaa1 | GIF | culture | https://socialbuzz.cdn.com/content/storage/4e4... |
| 999 | 999 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | b8c653b5-0118-4d7e-9bde-07c2de90f0ff | audio | technology | https://socialbuzz.cdn.com/content/storage/75d... |
1000 rows × 6 columns
#because we do not need the 'URL' and 'User ID' details, we are going to drop them:
df1d = df1.drop(columns = ['URL','User ID'])
df1d.head(5)
| Unnamed: 0 | Content ID | Type | Category | |
|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | Studying |
| 1 | 1 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | photo | healthy eating |
| 2 | 2 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | photo | healthy eating |
| 3 | 3 | 356fff80-da4d-4785-9f43-bc1261031dc6 | photo | technology |
| 4 | 4 | 01ab84dd-6364-4236-abbb-3f237db77180 | video | food |
print(df1d.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1000 non-null int64 1 Content ID 1000 non-null object 2 Type 1000 non-null object 3 Category 1000 non-null object dtypes: int64(1), object(3) memory usage: 31.4+ KB None
#we need to grab a unique category detail to know where each content belongs to:
df1d['Category'].unique()
array(['Studying', 'healthy eating', 'technology', 'food', 'cooking',
'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
'fitness', 'education', 'studying', 'veganism', 'Animals',
'animals', 'culture', '"culture"', 'Fitness', '"studying"',
'Veganism', '"animals"', 'Travel', '"soccer"', 'Education',
'"dogs"', 'Technology', 'Soccer', '"tennis"', 'Culture', '"food"',
'Food', '"technology"', 'Healthy Eating', '"cooking"', 'Science',
'"public speaking"', '"veganism"', 'Public Speaking', '"science"'],
dtype=object)
#the result of the unique category needs to be cleaned. The letter case are mixed and the quotation marks needs to be removed
#we are going to unify the letter case by using the .str.lower() command:
df1d['Category'] = df1d['Category'].str.lower()
df1d['Category'].unique()
array(['studying', 'healthy eating', 'technology', 'food', 'cooking',
'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
'fitness', 'education', 'veganism', 'animals', 'culture',
'"culture"', '"studying"', '"animals"', '"soccer"', '"dogs"',
'"tennis"', '"food"', '"technology"', '"cooking"',
'"public speaking"', '"veganism"', '"science"'], dtype=object)
#the next action is remove the quotation marks using .replace() command:
df1d['Category'].replace('"culture"','culture',inplace=True)
df1d['Category'].replace('"studying"','studying',inplace=True)
df1d['Category'].replace('"animals"','animals',inplace=True)
df1d['Category'].replace('"soccer"','soccer',inplace=True)
df1d['Category'].replace('"dogs"','dogs',inplace=True)
df1d['Category'].replace('"tennis"','tennis',inplace=True)
df1d['Category'].replace('"food"','food',inplace=True)
df1d['Category'].replace('"technology"','technology',inplace=True)
df1d['Category'].replace('"cooking"','cooking',inplace=True)
df1d['Category'].replace('"public speaking"','public speaking',inplace=True)
df1d['Category'].replace('"veganism"','veganism',inplace=True)
df1d['Category'].replace('"science"','science',inplace=True)
df1d['Category'].unique()
array(['studying', 'healthy eating', 'technology', 'food', 'cooking',
'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
'fitness', 'education', 'veganism', 'animals', 'culture'],
dtype=object)
#now that Category column has been clean, we are going to check for null values:
df1d.isnull().sum()
Unnamed: 0 0 Content ID 0 Type 0 Category 0 dtype: int64
#we are going to change the column name from Type to ContentType for readability:
df1d.rename(columns = {'Type':'ContentType'}, inplace = True)
df1d.head(5)
| Unnamed: 0 | Content ID | ContentType | Category | |
|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying |
| 1 | 1 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | photo | healthy eating |
| 2 | 2 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | photo | healthy eating |
| 3 | 3 | 356fff80-da4d-4785-9f43-bc1261031dc6 | photo | technology |
| 4 | 4 | 01ab84dd-6364-4236-abbb-3f237db77180 | video | food |
#the next table we are going to work with in this project is the reaction table which gives details of reactions to each content posted:
df2 = pd.read_csv('Reactions.csv')
df2
| Unnamed: 0 | Content ID | User ID | Type | Datetime | |
|---|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | NaN | NaN | 2021-04-22 15:17:15 |
| 1 | 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 5d454588-283d-459d-915d-c48a2cb4c27f | disgust | 2020-11-07 09:43:50 |
| 2 | 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 92b87fa5-f271-43e0-af66-84fac21052e6 | dislike | 2021-06-17 12:22:51 |
| 3 | 3 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 163daa38-8b77-48c9-9af6-37a6c1447ac2 | scared | 2021-04-18 05:13:58 |
| 4 | 4 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 34e8add9-0206-47fd-a501-037b994650a2 | disgust | 2021-01-06 19:13:01 |
| ... | ... | ... | ... | ... | ... |
| 25548 | 25548 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 80c9ce48-46f9-4f5e-b3ca-3b698fc2e949 | dislike | 2020-06-27 09:46:48 |
| 25549 | 25549 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 2bd9c167-e06c-47c1-a978-3403d6724606 | intrigued | 2021-02-16 17:17:02 |
| 25550 | 25550 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | NaN | interested | 2020-09-12 03:54:58 |
| 25551 | 25551 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 5ffd8b51-164e-47e2-885e-8b8c46eb63ed | worried | 2020-11-04 20:08:31 |
| 25552 | 25552 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 4edc3d1a-a7d9-4db6-89c3-f784d9954172 | cherish | 2021-01-04 04:55:11 |
25553 rows × 5 columns
#we are going to check for null values in the table:
df2.isnull().sum()
Unnamed: 0 0 Content ID 0 User ID 3019 Type 980 Datetime 0 dtype: int64
#we can drop the null values using .dropna() command:
df2.dropna(inplace = True)
df2.isnull().sum()
Unnamed: 0 0 Content ID 0 User ID 0 Type 0 Datetime 0 dtype: int64
df2
| Unnamed: 0 | Content ID | User ID | Type | Datetime | |
|---|---|---|---|---|---|
| 1 | 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 5d454588-283d-459d-915d-c48a2cb4c27f | disgust | 2020-11-07 09:43:50 |
| 2 | 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 92b87fa5-f271-43e0-af66-84fac21052e6 | dislike | 2021-06-17 12:22:51 |
| 3 | 3 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 163daa38-8b77-48c9-9af6-37a6c1447ac2 | scared | 2021-04-18 05:13:58 |
| 4 | 4 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 34e8add9-0206-47fd-a501-037b994650a2 | disgust | 2021-01-06 19:13:01 |
| 5 | 5 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | 9b6d35f9-5e15-4cd0-a8d7-b1f3340e02c4 | interested | 2020-08-23 12:25:58 |
| ... | ... | ... | ... | ... | ... |
| 25547 | 25547 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | b6d04982-1509-41ab-a700-b390d6cb4d02 | worried | 2020-10-31 04:50:14 |
| 25548 | 25548 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 80c9ce48-46f9-4f5e-b3ca-3b698fc2e949 | dislike | 2020-06-27 09:46:48 |
| 25549 | 25549 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 2bd9c167-e06c-47c1-a978-3403d6724606 | intrigued | 2021-02-16 17:17:02 |
| 25551 | 25551 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 5ffd8b51-164e-47e2-885e-8b8c46eb63ed | worried | 2020-11-04 20:08:31 |
| 25552 | 25552 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | 4edc3d1a-a7d9-4db6-89c3-f784d9954172 | cherish | 2021-01-04 04:55:11 |
22534 rows × 5 columns
df2d = df2.drop(columns = ['User ID'])
df2d
| Unnamed: 0 | Content ID | Type | Datetime | |
|---|---|---|---|---|
| 1 | 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2020-11-07 09:43:50 |
| 2 | 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | dislike | 2021-06-17 12:22:51 |
| 3 | 3 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | scared | 2021-04-18 05:13:58 |
| 4 | 4 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2021-01-06 19:13:01 |
| 5 | 5 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | interested | 2020-08-23 12:25:58 |
| ... | ... | ... | ... | ... |
| 25547 | 25547 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | worried | 2020-10-31 04:50:14 |
| 25548 | 25548 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | dislike | 2020-06-27 09:46:48 |
| 25549 | 25549 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | intrigued | 2021-02-16 17:17:02 |
| 25551 | 25551 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | worried | 2020-11-04 20:08:31 |
| 25552 | 25552 | 75d6b589-7fae-4a6d-b0d0-752845150e56 | cherish | 2021-01-04 04:55:11 |
22534 rows × 4 columns
#lets rename the columns:
df2d.rename(columns = {'Type':'ReactionType'}, inplace = True)
df2d.head(5)
| Unnamed: 0 | Content ID | ReactionType | Datetime | |
|---|---|---|---|---|
| 1 | 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2020-11-07 09:43:50 |
| 2 | 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | dislike | 2021-06-17 12:22:51 |
| 3 | 3 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | scared | 2021-04-18 05:13:58 |
| 4 | 4 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2021-01-06 19:13:01 |
| 5 | 5 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | interested | 2020-08-23 12:25:58 |
#still following the same process we used in the other tables
df3 = pd.read_csv('ReactionTypes.csv')
df3
| Unnamed: 0 | Type | Sentiment | Score | |
|---|---|---|---|---|
| 0 | 0 | heart | positive | 60 |
| 1 | 1 | want | positive | 70 |
| 2 | 2 | disgust | negative | 0 |
| 3 | 3 | hate | negative | 5 |
| 4 | 4 | interested | positive | 30 |
| 5 | 5 | indifferent | neutral | 20 |
| 6 | 6 | love | positive | 65 |
| 7 | 7 | super love | positive | 75 |
| 8 | 8 | cherish | positive | 70 |
| 9 | 9 | adore | positive | 72 |
| 10 | 10 | like | positive | 50 |
| 11 | 11 | dislike | negative | 10 |
| 12 | 12 | intrigued | positive | 45 |
| 13 | 13 | peeking | neutral | 35 |
| 14 | 14 | scared | negative | 15 |
| 15 | 15 | worried | negative | 12 |
df3.rename(columns = {'Type':'ReactionType'}, inplace = True)
df3.head(5)
| Unnamed: 0 | ReactionType | Sentiment | Score | |
|---|---|---|---|---|
| 0 | 0 | heart | positive | 60 |
| 1 | 1 | want | positive | 70 |
| 2 | 2 | disgust | negative | 0 |
| 3 | 3 | hate | negative | 5 |
| 4 | 4 | interested | positive | 30 |
#Lets take a look at the tables are are working with again:
df1d.head(5)#content
| Unnamed: 0 | Content ID | ContentType | Category | |
|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying |
| 1 | 1 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | photo | healthy eating |
| 2 | 2 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | photo | healthy eating |
| 3 | 3 | 356fff80-da4d-4785-9f43-bc1261031dc6 | photo | technology |
| 4 | 4 | 01ab84dd-6364-4236-abbb-3f237db77180 | video | food |
df2d.head(5)#reaction
| Unnamed: 0 | Content ID | ReactionType | Datetime | |
|---|---|---|---|---|
| 1 | 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2020-11-07 09:43:50 |
| 2 | 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | dislike | 2021-06-17 12:22:51 |
| 3 | 3 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | scared | 2021-04-18 05:13:58 |
| 4 | 4 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | disgust | 2021-01-06 19:13:01 |
| 5 | 5 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | interested | 2020-08-23 12:25:58 |
df3.head(5)#reactiontype
| Unnamed: 0 | ReactionType | Sentiment | Score | |
|---|---|---|---|---|
| 0 | 0 | heart | positive | 60 |
| 1 | 1 | want | positive | 70 |
| 2 | 2 | disgust | negative | 0 |
| 3 | 3 | hate | negative | 5 |
| 4 | 4 | interested | positive | 30 |
first_merge = pd.merge(df1d, df2d, on = 'Content ID')
first_merge.head(5)
| Unnamed: 0_x | Content ID | ContentType | Category | Unnamed: 0_y | ReactionType | Datetime | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 1 | disgust | 2020-11-07 09:43:50 |
| 1 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 2 | dislike | 2021-06-17 12:22:51 |
| 2 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 3 | scared | 2021-04-18 05:13:58 |
| 3 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 4 | disgust | 2021-01-06 19:13:01 |
| 4 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 5 | interested | 2020-08-23 12:25:58 |
#we are going to deal with the 'Unnamed' columns later in this project. But before then, we are going to combine the first merge with the reactiontype table
second_merge = pd.merge(first_merge, df3, on = 'ReactionType')
second_merge
| Unnamed: 0_x | Content ID | ContentType | Category | Unnamed: 0_y | ReactionType | Datetime | Unnamed: 0 | Sentiment | Score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 1 | disgust | 2020-11-07 09:43:50 | 2 | negative | 0 |
| 1 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 4 | disgust | 2021-01-06 19:13:01 | 2 | negative | 0 |
| 2 | 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | 35 | disgust | 2021-04-09 02:46:20 | 2 | negative | 0 |
| 3 | 1 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | photo | healthy eating | 52 | disgust | 2021-03-28 21:15:26 | 2 | negative | 0 |
| 4 | 2 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | photo | healthy eating | 88 | disgust | 2020-08-04 05:40:33 | 2 | negative | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22529 | 997 | 435007a5-6261-4d8b-b0a4-55fdc189754b | audio | veganism | 25489 | adore | 2020-10-04 22:26:33 | 9 | positive | 72 |
| 22530 | 997 | 435007a5-6261-4d8b-b0a4-55fdc189754b | audio | veganism | 25491 | adore | 2020-09-18 10:50:50 | 9 | positive | 72 |
| 22531 | 998 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | 25512 | adore | 2020-10-31 03:58:44 | 9 | positive | 72 |
| 22532 | 998 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | 25524 | adore | 2020-06-25 15:12:29 | 9 | positive | 72 |
| 22533 | 998 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | 25531 | adore | 2020-12-17 16:32:57 | 9 | positive | 72 |
22534 rows × 10 columns
#all the columns that are 'Unnamed' should be dropped using the .drop() command:
second_merge.drop(columns = ["Unnamed: 0_x","Unnamed: 0_y","Unnamed: 0"],axis = 1,inplace = True)
second_merge
| Content ID | ContentType | Category | ReactionType | Datetime | Sentiment | Score | |
|---|---|---|---|---|---|---|---|
| 0 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | disgust | 2020-11-07 09:43:50 | negative | 0 |
| 1 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | disgust | 2021-01-06 19:13:01 | negative | 0 |
| 2 | 97522e57-d9ab-4bd6-97bf-c24d952602d2 | photo | studying | disgust | 2021-04-09 02:46:20 | negative | 0 |
| 3 | 9f737e0a-3cdd-4d29-9d24-753f4e3be810 | photo | healthy eating | disgust | 2021-03-28 21:15:26 | negative | 0 |
| 4 | 230c4e4d-70c3-461d-b42c-ec09396efb3f | photo | healthy eating | disgust | 2020-08-04 05:40:33 | negative | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 22529 | 435007a5-6261-4d8b-b0a4-55fdc189754b | audio | veganism | adore | 2020-10-04 22:26:33 | positive | 72 |
| 22530 | 435007a5-6261-4d8b-b0a4-55fdc189754b | audio | veganism | adore | 2020-09-18 10:50:50 | positive | 72 |
| 22531 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | adore | 2020-10-31 03:58:44 | positive | 72 |
| 22532 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | adore | 2020-06-25 15:12:29 | positive | 72 |
| 22533 | 4e4c9690-c013-4ee7-9e66-943d8cbd27b7 | GIF | culture | adore | 2020-12-17 16:32:57 | positive | 72 |
22534 rows × 7 columns
#lets check the unique categories:
second_merge['Category'].unique()
array(['studying', 'healthy eating', 'dogs', 'public speaking', 'science',
'tennis', 'food', 'fitness', 'soccer', 'education', 'travel',
'veganism', 'cooking', 'technology', 'animals', 'culture'],
dtype=object)
second_merge['Category'].value_counts()
animals 1738 science 1646 healthy eating 1572 technology 1557 food 1556 culture 1538 cooking 1525 travel 1510 soccer 1339 education 1311 fitness 1284 studying 1251 dogs 1227 tennis 1218 veganism 1146 public speaking 1116 Name: Category, dtype: int64
#now let’s visualize the categories based on the counts in the data set using bar chart:
sns.countplot(y = 'Category',data = second_merge, color = "darkgreen")
<AxesSubplot:xlabel='count', ylabel='Category'>
#we can also visualize the breakdown of the categories in percentage using Ploty:
#To import Ploty:
import plotly.express as px
fig1 = px.pie(second_merge, names = 'Category', values = 'Score')
fig1.show()
#We'll take a look at the top 5 categories:
t5 = second_merge['Category'].value_counts().head(5)
t5
animals 1738 science 1646 healthy eating 1572 technology 1557 food 1556 Name: Category, dtype: int64
#we are going convert the result to a dataframe:
t5a = t5.reset_index()
t5a
| index | Category | |
|---|---|---|
| 0 | animals | 1738 |
| 1 | science | 1646 |
| 2 | healthy eating | 1572 |
| 3 | technology | 1557 |
| 4 | food | 1556 |
#We can rename the columns:
t5b = t5a.rename(columns = {'index':'Category','Category':'Score'})
t5b
| Category | Score | |
|---|---|---|
| 0 | animals | 1738 |
| 1 | science | 1646 |
| 2 | healthy eating | 1572 |
| 3 | technology | 1557 |
| 4 | food | 1556 |
#the percentage of top 5 categories can also be visualized using Ploty:
fig2 = px.pie(t5b, names = 'Category', values = 'Score')
fig2.show()
second_merge['Datetime'].value_counts()
2021-01-07 14:49:14 2
2020-06-27 06:28:56 2
2020-12-13 17:37:25 2
2020-08-10 18:01:52 2
2020-09-11 05:52:04 2
..
2020-11-16 09:44:42 1
2020-08-30 08:00:42 1
2021-03-15 00:15:46 1
2021-05-03 04:36:19 1
2020-12-17 16:32:57 1
Name: Datetime, Length: 22524, dtype: int64
months = pd.DatetimeIndex(second_merge['Datetime']).month.value_counts()
months
5 1954 1 1949 8 1945 12 1941 10 1889 7 1884 11 1866 9 1862 3 1857 6 1836 4 1801 2 1750 Name: Datetime, dtype: int64
pd.DatetimeIndex(second_merge['Datetime']).month.value_counts().nlargest(1)
5 1954 Name: Datetime, dtype: int64
The outcome from this project can help the company make informed decisions to improve their content offerings, prepare for a successful IPO, and establish a strong foundation for handling user data responsibly. The successful implementation of the insights from this project can lead to increased user engagement, investor interest, and regulatory compliance.